Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 25, 2025

What do these changes do?

The autoscaling service when running in computational mode tries to understand via the dask client what are the needs for EC2 instances.

  • In billable systems, the hardware is specified with the task, autoscaling checks it fits,
  • In non-billable systems required resources are specified with the task (e.g. CPUs, RAM, ...) , autoscaling finds the best suitable EC2 instance based on the resources,

For non-billable systems:
Until this PR, the autoscaling service would estimate the available resources from an EC2 using what the AWS EC2 API returns, which is not exactly what Docker or even Dask then sees once the EC2 instance is up and running.

This would sometimes create dead locks where a machine that should in theory handle the task would actually not since the docker engine and/or dask worker "sees" a bit less memory or cpus. This PR shall correct this fact by using the same computations everywhere.

**computational provider (a.k.a. dask)
Every dask-worker has a defined number of threads, a.k.a. the theoretical number of jobs that can be completed in parallel.
The dask-worker in our implementation either takes what CPUs it finds, or overrides it with DASK_NTHREADS environement, or use the DASK_NTHREADS_MULTIPLIER environment.

The meaning of this is that even if a user wants to run 50 tasks requiring 0.1 CPU and a machine has 20 CPUs, it cannot run more than nthreads in parallel.
This PR allows now the autoscaling service to understand this concept by allowing so-called "generic resources". This will also open the door to add GPU support and any kind of resource.

🚨🚨🚨 some caution on deployment to ensure everything runs as smooth as possible

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 25, 2025
@sanderegg sanderegg self-assigned this Sep 25, 2025
@sanderegg sanderegg added a:autoscaling autoscaling service in simcore's stack a:computational clusters labels Sep 25, 2025
@codecov
Copy link

codecov bot commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 96.34703% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.44%. Comparing base (379e430) to head (f2bcb52).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8423      +/-   ##
==========================================
+ Coverage   87.01%   87.44%   +0.43%     
==========================================
  Files        2011     1604     -407     
  Lines       78602    66901   -11701     
  Branches     1348      761     -587     
==========================================
- Hits        68392    58499    -9893     
+ Misses       9807     8149    -1658     
+ Partials      403      253     -150     
Flag Coverage Δ
integrationtests 63.96% <28.57%> (+3.57%) ⬆️
unittests 85.94% <96.34%> (-0.30%) ⬇️
Components Coverage Δ
pkg_aws_library 94.98% <100.00%> (+1.37%) ⬆️
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library 79.00% <76.92%> (-0.34%) ⬇️
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 70.96% <62.50%> (-0.01%) ⬇️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 84.77% <ø> (-0.24%) ⬇️
agent 93.10% <ø> (ø)
api_server 91.62% <ø> (ø)
autoscaling 95.83% <99.21%> (+0.83%) ⬆️
catalog 92.06% <ø> (ø)
clusters_keeper 99.14% <ø> (ø)
dask_sidecar 92.38% <ø> (ø)
datcore_adapter 97.95% <ø> (ø)
director 75.81% <ø> (ø)
director_v2 91.02% <100.00%> (+5.70%) ⬆️
dynamic_scheduler 96.66% <ø> (ø)
dynamic_sidecar 90.37% <ø> (-0.07%) ⬇️
efs_guardian 89.83% <ø> (ø)
invitations 90.90% <ø> (ø)
payments 92.80% <ø> (ø)
resource_usage_tracker 92.22% <ø> (ø)
storage 86.56% <ø> (-0.37%) ⬇️
webclient ∅ <ø> (∅)
webserver 87.05% <66.66%> (-0.02%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 379e430...f2bcb52. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 25, 2025

🧪 CI Insights

Here's what we observed from your CI run for f2bcb52.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI system-tests Base branch is broken, but retries were needed. Could be early signs of flakiness 👀 Broken 1 View View

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 1eb7d20 to 9b3ec9f Compare September 25, 2025 11:56
@sanderegg sanderegg requested a review from Copilot September 25, 2025 15:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the computational backend to properly compute and track the number of threads for dask-workers in the autoscaling system. The key changes involve adding support for generic resources (particularly threads) to the Resources model and extending the Dask monitoring configuration.

  • Add generic resources support to the Resources model with proper arithmetic operations
  • Introduce DASK_NTHREADS and DASK_NTHREADS_MULTIPLIER configuration settings
  • Update resource comparison and computation logic to handle the new generic resources

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/clusters-keeper/src/simcore_service_clusters_keeper/data/docker-compose.yml Adds DASK_NTHREADS environment variables to the compose file
services/autoscaling/src/simcore_service_autoscaling/core/settings.py Introduces new Dask thread configuration settings
packages/aws-library/src/aws_library/ec2/_models.py Extends Resources model with generic_resources field and related operations
services/autoscaling/src/simcore_service_autoscaling/modules/dask.py Adds function to compute instance thread resources
services/autoscaling/src/simcore_service_autoscaling/utils/cluster_scaling.py Updates resource comparison logic
services/autoscaling/tests/unit/test_modules_cluster_scaling_computational.py Refactors resource mapping logic and updates tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 3 times, most recently from ec5ef47 to f68a40d Compare September 26, 2025 14:53
@sanderegg sanderegg requested a review from Copilot September 26, 2025 16:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch 4 times, most recently from 8b7db41 to 83bf3b0 Compare October 17, 2025 14:37
@sanderegg sanderegg requested a review from Copilot October 19, 2025 20:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 2c11b7e to 0dcad8e Compare October 20, 2025 11:55
@sanderegg sanderegg requested a review from Copilot October 20, 2025 11:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py:1

  • Unconditionally injecting a thread resource of 1 per task in both processing and unrunnable lists duplicates logic and may misrepresent tasks that already define a thread-related generic resource. Centralizing this augmentation (e.g. via a helper that only adds the key if absent) reduces duplication and prevents accidental overwrites.
import collections

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg requested a review from Copilot October 20, 2025 15:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg requested a review from Copilot October 20, 2025 16:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg modified the milestones: Cheops, Imparable Oct 21, 2025
@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 32d9139 to cbe908b Compare October 21, 2025 16:29
@sanderegg sanderegg force-pushed the autoscaling/dask-provider-check-nthreads branch from 7f30e5d to f2bcb52 Compare October 24, 2025 14:31
@sonarqubecloud
Copy link

@sanderegg sanderegg merged commit ffe52c1 into ITISFoundation:master Oct 24, 2025
144 of 148 checks passed
@sanderegg sanderegg deleted the autoscaling/dask-provider-check-nthreads branch October 24, 2025 15:32
sanderegg added a commit to sanderegg/osparc-simcore that referenced this pull request Oct 24, 2025
…dask-worker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)"

This reverts commit ffe52c1.
sanderegg added a commit to sanderegg/osparc-simcore that referenced this pull request Oct 24, 2025
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)

Co-authored-by: Copilot <[email protected]>
sanderegg added a commit to sanderegg/osparc-simcore that referenced this pull request Oct 25, 2025
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)

Co-authored-by: Copilot <[email protected]>
sanderegg added a commit to sanderegg/osparc-simcore that referenced this pull request Oct 28, 2025
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)

Co-authored-by: Copilot <[email protected]>
sanderegg added a commit to sanderegg/osparc-simcore that referenced this pull request Oct 28, 2025
…ker is computed for autoscaling 🚨🚨🚨 (ITISFoundation#8423)

Co-authored-by: Copilot <[email protected]>
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Oct 31, 2025
56 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:autoscaling autoscaling service in simcore's stack a:computational clusters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autoscaling: in non-billable systems the chosen machine type does not take in account the removed resources as the dask-sidecar does

4 participants